Resampling Methods


Cross-Validation and the Bootstrap


Training Error versus Test Error


Validation-Set Approach


library(tidyverse)
library(ISLR2)
library(rsample)

#set seed
set.seed(1234)


#data
Auto <- ISLR2::Auto


#Split the data into training and validation
split_auto  <- initial_split(Auto, prop = 0.5)
train_auto  <- training(split_auto)
test_auto   <- testing(split_auto)


#estimate a linear model
lm.fit <- lm(mpg ~ horsepower, data = train_auto)
summary(lm.fit)


#MSE
pred.fit <- predict(lm.fit, newdata = test_auto, type = "response")
mean((test_auto$mpg - pred.fit)^2)

#Therefore, the estimated test MSE for the linear regression fit is 23.94

K-Fold Cross-Validation

  • Widely used approach for estimating test error

  • Estimates can be used to select best model, and to give an idea of the test error of the final chosen model

  • Idea is to randomly divide the data into \(K\) equal-sized parts. We leave out part \(k\), fit the model to the other \(K − 1\) parts (combined), and then obtain predictions for the left-out \(kth\) part

  • This is done in turn for each part \(k = 1, 2, ..., k\) and then the results are combined

  • Example

    • Divide data into \(K\) roughly equal-sized parts (\(K = 5\) for this example)


  • Let the \(K\) parts be \(C_1, C_2,...,C_k\) denotes the indices of the observations in part \(k\)

    • There are \(n_k\) observations in part \(k\): if \(N\) is a multiple of \(K\), then \(n_k = n/K\)

    • Compute \(CV_{(K)} = \sum_{k=1}^{k} \frac{n_k}{n} MSE_k\)


      • where \(MSE_k = \sum_{i \in C_k} (y_i - \hat{y_i})^2 / n_k\) and \(\hat{y_i}\) is the fit for observation \(i\), obtained from the data with part \(k\) removed


      • Setting \(K = n\) yields \(n\)-fold or leave-one out cross validation (LOOVC)


  • With least-squares linear or polynomial regression, an amazing shortcut makes the cost of LOOCV the same as that of a single model fit! The following formula holds

    • \(CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} (\frac{y_i-\hat{y_i}}{1-h_i})^2\)

    • Where \(y_i\) is the \(ith\) fitted value from the original least squares fit and \(h_i\) is the leverage

      • This is like the ordinary MSE, except the \(ith\) residual is divided by \(1 - h_i\)
    • LOOCV is sometimes useful, but typically doesn’t shake up the data enough. The estimates from each fold are highly correlated and hence their average can have high variance

    • A better choice is \(K = 5\) or 10

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
#set seed
set.seed(1234)


#data
Auto <- ISLR2::Auto


#specify the cross-validation method
ctrl <- trainControl(method = "cv", number = 10)


#fit a regression model and use k-fold CV to evaluate performance
model.fit <- train(mpg ~ horsepower, data = Auto, method = "lm", trControl = ctrl)


#view summary of k-fold CV               
print(model.fit)


#view final model
model.fit$finalModel


#view prediction for each fold
model.fit$resample

Other Issues with Cross-Validation


Cross-Validation for Classification Problems

  • We divide the data into \(K\) roughly equal-sized parts \(C_1, C_2, ..., C_k\)

    • Where \(C_k\) denotes the indices of the observations in part \(k\)

    • There are \(n_k\) observations in part \(k\): if \(n\) is a multiple of \(k\), then \(n_k = n/K\)

    • Compute

      • \(CV_{(K)} = \sum_{k=1}^{K} \frac{n_k}{n} Err_k\)

        • Where \(Err_k = \sum_{i \in C_k} I(y_i \neq \hat{y_i}) / n_k\)
      • The estimated standard deviation of \(CV_K\) is

        • \(\hat{SE}(CV_K = \sqrt{\frac{1}{K} \sum_{k=1}^{K} \frac{(Err_k-\bar{Err_k})^2}{K-1}}\)

        • This is a useful estimate, but strictly speaking, not quite valid


Cross-Validation: Right and Wrong



The Bootstrap


A Simple Example

  • Suppose that we wish to invest a fixed sum of money in two financial assets that yield returns of X and Y , respectively, where X and Y are random quantities

  • We will invest a fraction \(\alpha\) of our money in X, and will invest the remaining 1-\(\alpha\) in Y

  • We wish to choose \(\alpha\) to minimize the total risk, or variance, of our investment. In other words, we want to minimize \(Var(\alpha X + (1-\alpha)Y)\)

  • One can show that the value that minimizes the risk is given by

    • \(\alpha = \frac{\sigma_{Y}^{2}-\sigma_{XY}}{\sigma_{X}^{2} + \sigma_{Y}^{2}-2\sigma_{XY}}\)

      • Where \(\sigma_{X}^{2} = Var(X)\), \(\sigma_{Y}^{2} = Var(Y)\), and \(\sigma_{XY} = Cov(X,Y)\)

      • But the values of \(\sigma_{X}^{2}\), \(\sigma_{Y}^{2}\), and \(\sigma_{XY}\) are unknown

        • We can compute estimates for these quantities, \(\hat{\sigma_{X}^{2}}\), \(\hat{\sigma_{Y}^{2}}\), and \(\hat{\sigma_{XY}}\) using a data set that contains measurements for X and Y

        • We can then estimate the value of \(\alpha\) that minimizes the variance of our investment

          • \(\hat{\alpha} = \frac{\hat{\sigma_{Y}^{2}} - \hat{\sigma_{XY}}}{\hat{\sigma_{X}^{2}} + \hat{\sigma_{Y}^{2}}- 2\hat{\sigma}_{XY}}\)


  • Each panel displays 100 simulated returns for investments X and Y . From left to right and top to bottom, the resulting estimates for α are 0.576, 0.532, 0.657, and 0.651

  • To estimate the standard deviation of \(\hat{\alpha}\), we repeated the process of simulating 100 paired observation of X and Y, and estimating \(\alpha\) 1,000 times

  • We thereby obtained 1,000 estimates for \(\alpha\), which we can call \(\hat{\alpha_1}, \hat{\alpha_2}, ..., \hat{\alpha_1000}\)

  • The left-hand panel of the figure below displays a histogram of the resulting estimates

    • For these simulations the parameters were set to \(\sigma_X^2 = 1\), \(\sigma_Y^2 = 1.25\), \(\sigma_{XY} = 0.5\) and we know that the true value of \(\alpha\) is 0.6 (indicated by the red line)
  • On the left side a histogram of the estimates of \(\alpha\) obtained by generating 1,000 simulated data sets from the true population.

  • In the center a histogram of the estimates of \(\alpha\) obtained from 1,000 bootstrap samples from a single data set.

  • On the right the estimates of \(\alpha\) displayed in the left and center panels are shown as boxplots. In each panel, the pink line indicates the true value of \(\alpha\)


  • The mean over all 1,000 estimates for \(\alpha\) is \(\bar{\alpha} = \frac{1}{1000} \sum_{r=1}^{1000} \hat{\alpha}_r = 0.5996\)

    • Very close to \(\alpha = 0.6\) and the standard deviation of the estimates is

      • \(\sqrt{\frac{1}{1000-1} \sum_{r=1}^{1000} (\hat{\alpha}_r - \bar{\alpha})^2} = 0.083\)
    • This gives us a very good idea of the accuracy of \(\hat{\alpha}\): \(SE(\hat{\alpha} \approx 0.083\))

    • So roughly speaking, for a random sample from the population, we would expect \(\hat{\alpha}\) to differ from \(\alpha\) by approximately 0.08, on average

library(boot)
## 
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
## 
##     melanoma
#set seed
set.seed(1234)


#data
Auto <- ISLR2::Auto


#fit a linear model
fit.lm1 <- lm(mpg ~ horsepower, data = Auto)
summary(fit.lm1)


#bootstrap function
boot.fn <- function(Auto, index){
  fit.lm2 <- lm(mpg ~ horsepower, data = Auto[index,])
  return(coef(fit.lm2))
}


#bootstrap
bootstrap.coef <- boot(data = Auto, boot.fn, R = 1000)
bootstrap.coef

What About in the Real World?

  • The procedure outlined above cannot be applied, because for real data we cannot generate new samples from the original population

  • However, the bootstrap approach allows us to use a computer to mimic the process of obtaining new data sets, so that we can estimate the variability of our estimate without generating additional samples

  • Rather than repeatedly obtaining independent data sets from the population, we instead obtain distinct data sets by repeatedly sampling observations from the original data set with replacement

  • Each of these “bootstrap data sets” is created by sampling with replacement, and is the same size as our original dataset. As a result some observations may appear more than once in a given bootstrap data set and some not at all

  • An example with just 3 observations


  • Above, the graphical illustration is of the bootstrap approach on a small sample containing n = 3 observations. Each bootstrap data set contains n observations, sampled with replacement from the original data set. Each bootstrap data set is used to obtain an estimate of \(\alpha\)

The Bootstrap in General

  • In more complex data situations, figuring out the appropriate way to generate bootstrap samples can require some thought

  • For example, if the data is a time series, we can’t simply sample the observations with replacement

  • We can instead create blocks of consecutive observations, and sample those with replacements. Then we paste together sampled blocks to obtain a bootstrap dataset


Other Uses of the Bootstrap

  • Primarily used to obtain standard errors of an estimate

  • Also provides approximate confidence intervals for a population parameter

  • The above interval is called a Bootstrap Percentile confidence interval. It is the simplest method (among many approaches) for obtaining a confidence interval from the bootstrap


Can the Bootstrap Estimate Prediction Error?

  • In cross-validation, each of the K validation folds is distinct from the other K − 1 folds used for training: there is no overlap. This is crucial for its success

  • To estimate prediction error using the bootstrap, we could think about using each bootstrap dataset as our training sample, and the original sample as our validation sample

  • But each bootstrap sample has significant overlap with the original data. About two-thirds of the original data points appear in each bootstrap sample

  • This will cause the bootstrap to seriously underestimate the true prediction error

  • The other way around— with original sample = training sample, bootstrap dataset = validation sample— is worse!

  • Removing the overlap

    • Can partly fix this problem by only using predictions for those observations that did not (by chance) occur in the current bootstrap sample

    • But the method gets complicated, and in the end, cross-validation provides a simpler, more attractive approach for estimating prediction error


Pre-Validation


The Bootstrap versus Permutation Tests

  • The bootstrap samples from the estimated population, and uses the results to estimate standard errors and confidence intervals

  • Permutation methods sample from an estimated null distribution for the data, and use this to estimate p-values and False Discovery Rates for hypothesis tests

  • The bootstrap can be used to test a null hypothesis in simple situations. Eg if \(\theta\) = 0 is the null hypothesis, we check whether the confidence interval for \(\theta\) contains zero

  • Can also adapt the bootstrap to sample from a null distribution (See Efron and Tibshirani book “An Introduction to the Bootstrap” (1993), chapter 16) but there’s no real advantage over permutations